SimpleQA

A factuality benchmark measuring whether language models can answer short, fact-seeking questions — and know when they don’t know the answer

Published

September 12, 2025

Keywords: SimpleQA, factuality benchmark, hallucination evaluation, fact-seeking QA, short-form factuality, language model calibration, OpenAI benchmark, correct incorrect not attempted, LLM trustworthiness, GPT-4o, o1-preview, Claude, knowledge grounding

Introduction

Language models hallucinate. They produce confident-sounding answers to questions they cannot reliably answer, and distinguishing fact from fabrication remains one of the hardest open problems in AI. Existing factuality benchmarks like TriviaQA (2017) and Natural Questions (2019) have become saturated — frontier models score above 90% — leaving little room to measure progress.

SimpleQA tackles this directly. Created by OpenAI, it is a benchmark of 4,326 short, fact-seeking questions where every answer is a single, indisputable fact verified by two independent human annotators. Each model response is graded as correct, incorrect, or not attempted — making it possible to measure not just accuracy but also whether a model knows what it knows.

“SimpleQA is a simple, targeted evaluation for whether models ‘know what they know,’ and our hope is that this benchmark will remain relevant for the next few generations of frontier models.” — Jason Wei et al., SimpleQA Paper

graph LR
    A["Older Benchmarks<br/>TriviaQA · NQ<br/>Saturated >90%"] --> B["Hallucination<br/>Problem Persists"]
    B --> C["SimpleQA<br/>4,326 fact-seeking Q&A<br/>Adversarially collected"]
    C --> D["Measures factuality<br/>+ calibration of<br/>frontier LLMs"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Does SimpleQA Measure?

SimpleQA evaluates short-form factual accuracy — can a model answer a specific knowledge question correctly, and does it refrain from answering when it doesn’t know? The benchmark was designed with four key properties:

Property	Description
High Correctness	Each question verified by 2 independent AI trainers; estimated ~3% error rate
Challenging	Adversarially collected against GPT-4 — at least one of four GPT-4 completions must fail
Diverse	Covers science, politics, art, geography, TV shows, video games, and more
Simple to Run	Short questions and answers; grading via a single ChatGPT classifier call

Grading System

Every model completion is classified into exactly one of three grades:

Grade	Definition	Example
Correct	Predicted answer fully contains the reference answer without contradiction	“Wout Weghorst”
Incorrect	Predicted answer contradicts the reference answer in any way	“Virgil van Dijk”
Not Attempted	Reference answer is not given and no contradiction exists	“I don’t know”

Metrics

SimpleQA reports three key metrics:

Correct (%): Percentage of all questions answered correctly — measures recall
Correct Given Attempted (%): Of questions the model attempted, what percentage were correct — measures precision
F-score: Harmonic mean of Correct and Correct Given Attempted

graph TD
    A["Model Response"] --> B{"ChatGPT<br/>Grader"}
    B -->|"Fully contains<br/>reference answer"| C["✅ Correct<br/>+1 point"]
    B -->|"Contradicts<br/>reference answer"| D["❌ Incorrect<br/>−p penalty"]
    B -->|"Doesn't attempt<br/>to answer"| E["⬜ Not Attempted<br/>0 points"]
    C --> F["Correct %<br/>= correct / total"]
    D --> F
    E --> F
    C --> G["Correct Given Attempted<br/>= correct / (correct + incorrect)"]
    D --> G

    style A fill:#9b59b6,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#e74c3c,stroke:#333,color:#fff
    style E fill:#95a5a6,stroke:#333,color:#fff
    style F fill:#3498db,stroke:#333,color:#fff
    style G fill:#3498db,stroke:#333,color:#fff

Who Is Behind SimpleQA?

SimpleQA was created at OpenAI by:

Jason Wei (lead author), Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus

The paper “Measuring short-form factuality in large language models” was published on October 30, 2024 (blog) and November 7, 2024 (arXiv: 2411.04368).

What Skills Does It Test?

graph LR
    subgraph "Question Topics (4,326 Q&A)"
        A["Science & Tech<br/>858 questions"] 
        B["Politics<br/>709 questions"]
        C["Art<br/>550 questions"]
        D["Geography · History<br/>TV · Sports · Games"]
    end
    subgraph "Answer Types"
        E["Dates 32.8%"]
        F["Persons 24.1%"]
        G["Numbers 15.3%"]
        H["Places 9.9%"]
        I["Other 18.0%"]
    end

    style A fill:#3498db,stroke:#333,color:#fff
    style B fill:#e74c3c,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#f39c12,stroke:#333,color:#fff
    style E fill:#9b59b6,stroke:#333,color:#fff
    style F fill:#9b59b6,stroke:#333,color:#fff
    style G fill:#9b59b6,stroke:#333,color:#fff
    style H fill:#9b59b6,stroke:#333,color:#fff
    style I fill:#9b59b6,stroke:#333,color:#fff

The dataset is adversarially curated: questions had to make at least one GPT-4 variant produce an incorrect answer. Every question was independently verified by a second annotator, and only questions where both annotators agreed on the answer were kept. A third annotator cross-checked 1,000 random samples, confirming a 94.4% agreement rate with the original answers.

Dashboard — SimpleQA Leaderboard

OpenAI Models — Detailed Breakdown (from Paper)

Results from the original SimpleQA paper (November 2024):

Model	Correct (%)	Not Attempted (%)	Incorrect (%)	Correct Given Attempted (%)	F-score
o1-preview	42.7	9.2	48.1	47.0	44.8
GPT-4o	38.2	1.0	60.8	38.6	38.4
Claude 3.5 Sonnet	28.9	35.0	36.1	44.5	35.0
Claude 3 Opus	23.5	39.6	36.9	38.8	29.3
GPT-4o-mini	8.6	0.9	90.5	8.7	8.6
o1-mini	8.1	28.5	63.4	11.3	9.4
Claude 3 Sonnet	5.7	75.0	19.3	22.9	9.2
Claude 3 Haiku	5.1	75.3	19.6	20.6	8.2

Source: arXiv:2411.04368, Table 3 (November 7, 2024)

Extended Leaderboard — SimpleQA Correct (%)

Results from the OpenAI simple-evals repository, showing the “Correct %” metric across all evaluated models:

Rank	Model	SimpleQA Correct (%)
1	GPT-4.5 Preview	62.5
2	o3	49.4
3	o3-low	49.4
4	o3-high	48.6
5	o1	42.6
6	o1-preview	42.4
7	GPT-4.1	41.6
8	GPT-4o (2024-08-06)	40.1
9	GPT-4o (2024-05-13)	39.0
10	GPT-4o (2024-11-20)	38.8
11	Claude 3.5 Sonnet	28.9
12	GPT-4 Turbo	24.2
13	Claude 3 Opus	23.5
14	o4-mini	20.2
15	o4-mini-low	20.2
16	o4-mini-high	19.3
17	GPT-4.1 Mini	16.8
18	o3-mini-high	13.8
19	o3-mini	13.4
20	o3-mini-low	13.0
21	GPT-4o-mini	9.5
22	o1-mini	7.6
23	GPT-4.1 Nano	7.6

Source: github.com/openai/simple-evals, consulted March 29, 2026

Key Insights from the Results

GPT-4.5 Preview dominates at 62.5% — OpenAI’s most knowledge-dense model, designed to prioritize breadth of world knowledge
Reasoning models (o3, o1) score well but not as high as GPT-4.5, suggesting factual recall ≠ reasoning ability
Small models struggle badly — GPT-4o-mini at 9.5%, o1-mini at 7.6%, GPT-4.1 Nano at 7.6%
Claude models are conservative — Claude 3 Haiku and Sonnet chose “not attempted” for 75% of questions, keeping their incorrect rate low but correct rate very low
GPT-4o-mini is overconfident — only 0.9% not attempted but 90.5% incorrect, showing extreme hallucination tendency on hard factual questions

graph TD
    A["SimpleQA Results<br/>Key Patterns"] --> B["Large Models<br/>More Factual<br/>GPT-4.5: 62.5%"]
    A --> C["Reasoning ≠ Facts<br/>o3: 49% vs<br/>GPT-4.5: 62.5%"]
    A --> D["Small Models<br/>Hallucinate More<br/>GPT-4o-mini: 9.5%"]
    A --> E["Calibration Varies<br/>Claude: cautious<br/>GPT-4o-mini: overconfident"]

    style A fill:#2c3e50,stroke:#333,color:#fff
    style B fill:#27ae60,stroke:#333,color:#fff
    style C fill:#3498db,stroke:#333,color:#fff
    style D fill:#e74c3c,stroke:#333,color:#fff
    style E fill:#f39c12,stroke:#333,color:#fff

Calibration — Do Models Know What They Know?

One of SimpleQA’s most valuable contributions is measuring calibration — whether a model’s stated confidence correlates with its actual accuracy. The paper found:

Larger models are better calibrated — o1-preview and GPT-4o outperform their mini variants
All models overstate confidence — stated confidence consistently exceeds actual accuracy
Frequency-based calibration works — when asked the same question 100 times, the most-frequent answer’s frequency correlates with its correctness
o1-preview is most calibrated — its answer frequency roughly matches its accuracy

This means SimpleQA measures two things: (1) what a model knows, and (2) whether it knows what it knows.

Data Collection Pipeline

graph TD
    A["AI Trainer #1<br/>Creates question + answer<br/>with web source"] --> B["ChatGPT Classifiers<br/>Check criteria violations<br/>(ambiguous, temporal, etc.)"]
    B --> C["AI Trainer #2<br/>Independently answers<br/>without seeing original"]
    C --> D{"Both trainers<br/>agree?"}
    D -->|No| E["❌ Removed"]
    D -->|Yes| F["Kept in Dataset"]
    F --> G["Quality Filters<br/>2+ unique source domains<br/>+ timeless + single answer"]
    G --> H["Final Dataset<br/>4,326 questions"]
    H --> I["Trainer #3 Spot Check<br/>1,000 samples → 94.4% agreement"]

    style A fill:#3498db,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#3498db,stroke:#333,color:#fff
    style D fill:#9b59b6,stroke:#333,color:#fff
    style E fill:#e74c3c,stroke:#333,color:#fff
    style F fill:#27ae60,stroke:#333,color:#fff
    style G fill:#f39c12,stroke:#333,color:#fff
    style H fill:#27ae60,stroke:#333,color:#fff
    style I fill:#2c3e50,stroke:#333,color:#fff

Key requirements for every question:

Single indisputable answer — “which city” not just “where”
Answer must not change over time — no “who is the current president” style questions
Must be challenging — at least one GPT-4 completion must be incorrect
Answerable as of December 31, 2023 — to fairly evaluate all models
Supported by evidence — reference answers backed by web sources from both annotators

Where to Explore SimpleQA

Resource	Link
arXiv Paper	arxiv.org/abs/2411.04368
OpenAI Blog Post	openai.com/index/introducing-simpleqa
GitHub (simple-evals)	github.com/openai/simple-evals
HuggingFace Dataset	huggingface.co/datasets/openai/SimpleQA
License	MIT License

Watch the Video

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

References

Wei, J., Karina, N., Chung, H.W., Jiao, Y.J., Papay, S., Glaese, A., Schulman, J., & Fedus, W. (2024). Measuring short-form factuality in large language models. arXiv:2411.04368.
OpenAI. (2024). Introducing SimpleQA. openai.com/index/introducing-simpleqa.
OpenAI. (2024). simple-evals: A lightweight library for evaluating language models. github.com/openai/simple-evals.
Joshi, M., Choi, E., Weld, D., & Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL 2017.
Kwiatkowski, T. et al. (2019). Natural Questions: A Benchmark for Question Answering Research. TACL 2019.
Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.

Humanity’s Last Exam — the ultimate frontier benchmark across 100+ academic disciplines
GPQA Diamond — graduate-level science questions that challenge expert reasoning
MMMLU — massively multilingual multitask language understanding
MMMU-Pro — multimodal understanding pushing beyond text-only evaluation
OpenAI MRCR — multi-round coreference resolution for long-context reliability